x <- seq(-1, 5, by = 0.1)
y <- sin(x)
plot(x, y)
In this brief tutorial, we will review some basic ideas about smoothing and start thinking through how we can express these ideas mathematically and in R.
Tobler’s profound - but deceptively simple - first law states that:
“Everything is related to everything else. But near things are more related than distant things.” (tobler1970a?)
He applied this idea to his development of a dynamic model of urban growth in the Detroit region which assumed that rates of population growth were spatially similar:
In this post, we’re going to start with a simpler problem - change in the value of a function in one dimension - to see how we can translate the concept of distance decay implied by Tobler’s first law (TFL) into a useful model. To begin, we’re going to keep it simple with no noise or observation error and just interpolate some values.
In this example, we are interested in visualizing and predicting the values of a function \(f(x_i)\) which outputs values \(y_i\), the expected value of the output function.
\[ y_i = f(x_i) \]
Lets start by getting the values of \(f(x)\) for every input value \(x\). For simplicity, we will assume that \(f(x)\) is a sine function and that the values of \(x\) go from -1 to +5, allowing us to observe one half cycle of the sine function:
x <- seq(-1, 5, by = 0.1)
y <- sin(x)
plot(x, y)
You can see right away that this simple curve pretty neatly expresses Tobler’s first law: \(f(x)\) values of each point are in general more similar to each other for nearby values of \(x\). If we want to press this idea into real-world practice, we need a model that can translate TFL into quantitative estimates and qualitative visualizations. There are lots of ways to do this, but we’ll focus in on locally-weighted regression, also known LOWESS.
The basic idea of a LOWESS regression is to define a window of size \(k\) points around each value one wishes to estimate, and calculate a weighted average of the value of those points, which can then be used as the estimated value \(\hat{y_j} \approx f(x_j)\). We then run the values of these nearest neighbors through a weight function \(w(x)\).
These weight functions can take a lot of different forms, but we’ll start simple with a uniform one, i.e. just taking the average of the \(k\) nearest neighbors, so that \(\hat{y} = sum(z(x_i, k))/k\), where \(KNNz\) is a function returning the \(y\) values of the k nearest observations to \(x_i\). The value of \(k\) is sometimes referred to as the bandwidth of the smoothing function: Larger bandwidths use more data to estimate values at each point, smaller ones use less.
Using the fnn package for R, we can find the indices of the \(k\) nearest neighbors of each point we want to make an estimate at:
library(FNN)
k <- 10
z <- knn.index(x, k = k)You can read the output of this function, below, as indicating the indices (\(i\)) of the 10 nearest points to each of the values of \(x\).
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2 3 4 5 6 7 8 9 10 11
[2,] 1 3 4 5 6 7 8 9 10 11
[3,] 2 4 1 5 6 7 8 9 10 11
[4,] 5 3 6 2 1 7 8 9 10 11
[5,] 6 4 7 3 8 2 1 9 10 11
[6,] 5 7 4 8 3 9 2 10 1 11
[7,] 8 6 9 5 10 4 11 3 12 2
[8,] 7 9 10 6 11 5 4 12 3 13
[9,] 10 8 11 7 12 6 5 13 14 4
[10,] 9 11 8 12 7 13 14 6 5 15
We can visualize this by picking a point in the middle of the series and its 10 nearest neighbors and the estimated value of \(\hat{y_i}\) obtained by just taking the average of the k nearest points:
library(ggplot2)
## Plot the original data
g <- ggplot() +
geom_point(aes(x = x, y = y)) +
xlab("x") +
ylab("f(x)")
## Now, get the index for x = 2
x_index <- which(x == 2)
## Show the range of k nearest neighbors of this point
knn_low <- min(x[z[x_index, ]])
knn_high <- max(x[z[x_index, ]])
y_hat <- mean(y[z[x_index, ]], )
## Add errorbars to the figure to show the 10 nearest values with the height of the point indicating the estimated value at y_i, denoted by the red dot
g <- g + geom_errorbarh(aes(xmin = knn_low, xmax = knn_high, y = y_hat)) + geom_point(aes(x = 2, y_hat), colour = "red")
plot(g)
Notice that if the knn function is applied at the low end of the series, i.e. to the first value, it will use points to the right of that one instead of to either side:
library(ggplot2)
## Plot the original data
g <- ggplot() +
geom_point(aes(x = x, y = y)) +
xlab("x") +
ylab("f(x)")
## Use the index for the lowest value
x_index <- 1
## Show the range of k nearest neighbors of this point
knn_low <- min(x[z[x_index, ]])
knn_high <- max(x[z[x_index, ]])
y_hat <- mean(y[z[x_index, ]], )
## Add errorbars to the figure to show the 10 nearest values with the height of the point indicating the estimated value at y_i
g <- g + geom_errorbarh(aes(xmin = knn_low, xmax = knn_high, y = y_hat)) +
geom_point(aes(x = x[x_index], y_hat), colour = "red")
plot(g)
Now, lets see what happens if we run our smoother over the whole series and take the average of the 10 nearest points for each and compare them to the observed data:
y_hat <- rep(0, length(x))
for (i in 1:length(x)) {
y_hat[i] <- mean(y[z[i, ]], )
}Now plot the predicted vs. the observed values:
g <- ggplot() +
geom_point(aes(x = x, y = y)) +
xlab("x") +
ylab("f(x)") +
geom_line(aes(x = x, y = y_hat), colour = "red")
plot(g)
You can see this does a pretty good job all the way through, except at the edges. Lets try it again with a smaller window - or bandwidth - of 5 and see what happens. First, we’ll write a function that will give us the predicted value of y at each point given a window of size k and an input value:
knn_est <- function(x, y, k) {
z <- knn.index(x, k = k)
y_hat <- rep(0, length(x))
for (i in 1:length(x)) {
y_hat[i] <- mean(y[z[i, ]], )
}
df <- data.frame(x = x, y = y, yhat = y_hat)
return(df)
}pred_df <- knn_est(x, y, 5)
g <- ggplot(pred_df) +
geom_point(aes(x = x, y = y)) +
geom_line(aes(x = x, y = yhat), colour = "red")
plot(g)
This gets rid of a lot of the weird effects at the edges but introduces some noise into the function. What if we make the window bigger, say 20, to get rid of some of the noise?

This seems to make the edge effects worse, as well as the estimates of the function overall worse.
What happens if we go in the opposite direction and shrink the window down to 2?

Why does this appear to be more accurate for these data than \(k=10\) and \(k=5\)?
What would happen if we added observation noise to the values of \(y_i\)? Which one of the smoothers do you think would work better then?
Is the one best value of \(k\) for all datasets? How might you go about picking the best one?
How does our uniform weight function express Tobler’s first law? What kind of weight function \(w(x)\) might do a better job of capturing the notion of distance decay?
When working with spatial data, our analytic task often falls into either of two rough categories.
Estimating local averages of a continuous variable. This could be something like neighborhood-by-neighborhood variation in average blood pressure or the intensity of some kind of environmental risk factor such as the intensity of fine (e.g. \(PM_{2.5}\)) dust which can cause respiratory illness.
Estimating the local density of or incidence of a particular outcome. For example, if we are trying to understand spatial variation in the incidence of a particular disease, we are interested in knowing how many cases of that disease are present in a given location or are likely to be present in some nearby unobserved location.
In this tutorial, we’ll dig into the problem of spatial density estimation in one dimension along a spatial transect. The techniques discussed here form the basis of a number of approaches for cluster or hotspot analysis. For examples of smoothing local values of a continuous variable, see ?sec-smoothing.
A spatial transect is an area of space along a line crossing a landscape. These are often used in ecology and forestry to assess the health of an environment, species diversity and other factors. Using a transect can help simplify the problem of spatial analysis down to one dimension rather than the usual two, while still providing a tremendous amount of useful information.

For example, (levy2014?) were interested in characterizing the intensity of exposure to triatomine bugs and other insect vectors of the pathogen T. cruzi, which causes Chagas disease.


Imagine we are estimating the density of some unknown insect vector along a 1 kilometer transect with the goal of characterizing the risk of infection with a vector-borne illness.
Much like in our discussion of kernel smoothing of continuous outcomes, kernel functions play a key role in this setting as well. In this case, imagine that the locations of vectors along our transect have been sampled at random from some unknown function \(f(x)\) which takes values from 0 (the beginning of the transect) to 1000m (the end).
We can use the Kernel function \(K(d)\) to approximate the intensity of the outcome of interest at each observed case location \(x_i\). Imagine that our observed data have locations \(x_1, x_2, \ldots, x_n\) and that the distance between our point of interest, \(x_j\) and each observed point is \(d_{ij} = | x_j - x_i |\).
Finally, lets include a bandwidth parameter, \(h\), which controls the width of the window we will use for smoothing. When we put this all together, we can get an estimate of the density of our outcome of interest at location \(x_j\) as follows:
\[ \hat{f_h}(x_j) = \frac{1}{n} \sum_{i=1}^{n} K(\frac{x_j - x_i}{h}) \]
As you can see below, we can pick a range of kernel functions, but for the sake of simplicity, in this example, we will focus in on a Gaussian, or normal, kernel, which uses the probability density function of a normal distribution to weight points.
Lets start by sampling locations of observed points along a one dimensional line. To keep things interesting, we’ll use a Gaussian mixture distribution with two components:

First, lets imagine a scenario in which the risk of observing an insect vector steadily decreases as we walk along our transect. However, along the way there is a hotspot of increased risk beyond what we would expect from the smooth decline before and after that spot. For the purpose of this example, we’ll assume that risk decays exponentially with distance from the origin, but that our hotspot is centered at a point 300 meters into the transect. The code below lets us sample the locations of the points along the transect where 🐜 are observed from two distributions:
An exponential distribution representing smooth decay from the beginning to the end of the transect, and
A normal distribution representing a hotspot about 150m in width beginning 300m in
The figure below shows a histogram of locations sampled from \(f(x)\) (vertical bars) overlaid with the true value of \(f(x)\) in red:
library(ggplot2)
d_a <- dexp(1:1000, rate = 1 / 250)
d_b <- dnorm(1:1000, mean = 300, sd = 50)
y <- ((1 - p_hot)) * d_a + (p_hot * d_b)
dens_df <- data.frame(x = 1:1000, y = y)
xdf <- data.frame(x = x)
g <- ggplot(xdf) +
geom_histogram(aes(x = x, y = ..density..), bins = 100) +
geom_line(data = dens_df, aes(x = x, y = y), colour = "red") +
xlim(0, 1000) +
ylab("Density") +
xlab("Distance from transect origin (m)")
plot(g)
Now, imagine we have another set of finely spaced points along the line, and for each, we want to calculate the weight for each. The function below lets us do that:
The figure below shows the true value of our density function \(f(x)\) in red, the density of points in the simulated data along the x-axis of the ‘rug plot’, and our smoothed density in black, for a bandwidth of \(h=10\):
library(ggplot2)
pred_df <- normal_smoother(x, h = 10)
g <- ggplot() +
geom_rug(aes(x = x)) +
geom_line(data = pred_df, aes(x = x, y = y)) +
ylab("Density") +
geom_line(data = dens_df, aes(x = x, y = y), colour = "red") +
xlim(0, 1000)
dens_ojs <- dens_df
dens_ojs$y <- dens_ojs$y * cc
plot(g)
Now, lets see what happens if we try this for different values of \(h\):
all_df <- data.frame()
for (hv in c(5, 10, 20, 50, 100, 250)) {
pred_df <- normal_smoother(x, h = hv)
pred_df$h <- hv
all_df <- rbind(all_df, pred_df)
}
all_df$h <- as.factor(all_df$h)
g <- ggplot(all_df) +
geom_rug(aes(x = x)) +
geom_line(data = dens_df, aes(x = x, y = y), colour = "red") +
geom_line(aes(x = x, y = y)) +
ylab("Density") +
facet_wrap(~h) +
xlim(0, 1000)
plot(g)
all_df <- data.frame()
hvals <- seq(1, 100, by = 2)
distvals <- seq(-100, 100, by = 1)
all_kernvals <- data.frame()
for (hv in hvals) {
pred_df <- normal_smoother(x, h = hv)
pred_df$h <- hv
pred_df$smoother <- "gaussian"
all_df <- rbind(all_df, pred_df)
all_kernvals <- rbind(all_kernvals, data.frame(x = distvals, y = kdgaussian(distvals, bw = hv), smoother = "gaussian", h = hv))
pred_df <- normal_smoother(x, h = hv, kern = kduniform, kernp = kpuniform)
pred_df$h <- hv
pred_df$smoother <- "uniform"
all_df <- rbind(all_df, pred_df)
all_kernvals <- rbind(all_kernvals, data.frame(x = distvals, y = kduniform(distvals, bw = hv), smoother = "uniform", h = hv))
pred_df <- normal_smoother(x, h = hv, kern = kdtricube, kernp = kptricube)
pred_df$h <- hv
pred_df$smoother <- "tricube"
all_df <- rbind(all_df, pred_df)
all_kernvals <- rbind(all_kernvals, data.frame(x = distvals, y = kdtricube(distvals, bw = hv), smoother = "tricube", h = hv))
pred_df <- normal_smoother(x, h = hv, kern = kdtriangular, kernp = kptriangular)
pred_df$h <- hv
pred_df$smoother <- "triangular"
all_df <- rbind(all_df, pred_df)
all_kernvals <- rbind(all_kernvals, data.frame(x = distvals, y = kdtriangular(distvals, bw = hv), smoother = "triangular", h = hv))
}
all_df$y <- all_df$y * ccYou can adjust the range of the bandwidth here to get a better sense of the relationship between the smoothed curve (black) and true density (red). Adjust the bin width for the histogram of the underlying data to get a sense of the fit of the model to the underlying data.
The figure below shows the relative amount of weight placed on different points as a function of their distance from the point of interest (0, marked by the vertical red line):
Which of the bandwidth options seems to do the best job in capturing the value of \(f(x)\)? Why?
How does the choice of kernel impact the smoothing?
How do the different kernel functions encode different assumptions about distance decay?
What is the relationship between the histogram of the data and the smoother? What do you see as you change the histogram bin width relative to the smoothing bandwidth?
Please see Matthew Conlen’s excellent interactive KDE tutorial